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SHOETEE AETICLES AND DISCUSSION 

THE ARITHMETIC OF THE PRODUCT MOMENT 
METHOD OF CALCULATING THE COEFFI- 
CIENT OF CORRELATION 

In view of the increasing application of refined statistical 
methods to scientific problems of all kinds it seems highly de- 
sirable to point out any simplification of arithmetical method 
which may lighten the necessarily formidable labor of calcula- 
tion. 

The statistical constant which has proved a most powerful 
tool in many fields of work is the coefficient of correlation. Be- 
sides the contingency constant 1 , the correlation ratio 2 and the 
well-known four-fold table and product moment methods, sev- 
eral alternative processes of determining correlation in the case 
of special data have been suggested. 3 In recent numbers of 
Science Dr. Boas 4 and Professor Pearson 5 have discussed the 
formula? to be used in the calculation of r, and in another place 
I have suggested methods which in some cases materially 
shorten the labor of calculation. 

When the nature of the material permits, the best method of 
calculating correlation is the product moment one. 

As conventionally described in the books/ this requires for the 

1 Pearson, K., ' ' On Contingency and its Relation to Association and Nor- 
mal Correlation," Draper's Co. Research Memoirs, Biometric Series, 1, 1904. 

2 Pearson, K., "On the General Theory of Skew Correlation and Non- 
linear Regression, " ibid., 2, 1905 . 

8 For instance, K. Pearson, ' ' On Further Methods of Determining Corre- 
lation," ibid., 4t, 1907; K. Pearson, "On a New Method of Determining 
Correlation," Biometrika, Vol. VII, pp. 248-257, 1910; K. Pearson, "On 
a New Method of Determining Correlation," Biometrika, Vol. VII, pp. 
96-105, 1909. 

4 Boas, F., "Determination of the Coefficient of Correlation," Science, 
N. S., Vol. XXIX, pp. 823-824, 1909. 

5 Pearson, IC, ' ' Determination of the Coefficient of Correlation, ' ' Science, 
N. S., Vol. XXX, pp. 23-25, 1909. 

"Harris, J. Arthur, "A short Method of Calculating the Coefficient of 
Correlation in the Case of Integral Variates," Biometrika, Vol. VII, pp. 
214-218, 1910. 

'Yule, G. V., "On the Theory of Correlation," Jour. Boy. Stat. Soc, 
Vol. LX, p. 812; Davenport, C. B., "Statistical Methods"; Elderton, W. P., 
"Frequency Curves and Correlation." 
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calculation of 8{xy) : (a) the writing down on the margin of the 
correlation table of an assumed mean, V , of both the x and the 
y characters, and the plus and minus deviations of the different 
grades from these origins; (b) the entering in the body of the 
table of the products of the deviations of the several classes of x 
and y from their respective assumed means, care being taken to 
regard sign, (c) the summation of the products of the class fre- 
quencies of x and y by the first two powers of their deviations 
from their assumed means; (d) the multiplication of the fre- 
quencies in the table by the products of the deviation of x and y, 
and the summation of their products, with regard to signs. 
This gives 

S(z'), S(x> 2 ), S(y>), S(y' 2 ), S{x')/N= ( l,., S{x' 2 ) /N=v 2 >, 
S(y')IN=d !h S(y' 2 )=v/, and S{x'y'), 

from which we may obtain the moments and the products mo- 
ments about their true means by the use of the formula? : 



<V = l // " 2 / — d„ 2 , 
S(xy)=S(x/y')-Nd x d,„ 
r = S( xy ) / Ncxo-y. 

Frequently there are several possible ways of carrying out the 
arithmetic for a given formula, and the one chosen is largely de- 
pendent on the mental and mechanical traits of the computer. 
Personally I have found the process described rather cumber- 
some, and in using it — especially through the hands of assist- 
ants — have found slips coming into the work with unfortunate 
frequency. 

The chief difficulty lies in the fact that the products of the 
deviations of x and y from their assumed means must be written 
down as indices to the frequencies in the body of the table itself. 
This requires considerable time if the table be at all large, and 
there is no way of checking the results for blunders except to 
go over the entire process again. After this has been done, all 
the multiplications of the frequencies into their indices must be 
carried out and the products summed. If the ranges of variation 
be at all large, say even no higher than 20 classes each, one may 
have to multiply some frequencies by such products as 63, 64, 
72, 81, etc., and the labor becomes rather great. 

Now since the formulae given above enable us to refer the 
moments or the product moments taken about any arbitrary 
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point or origin to the true means, I have found the following 
method to result in a material saving of time and energy. 

If for y arrays of integral variates s we take instead of the 
grade thought to be nearest the mean as the origin, we at once 
obtain the total for the array by summing the products of the 
frequencies by their grades. By multiplying up by the grade of 
x character we at once obtain S(x'y'). 

I illustrate by a table showing the relationship between the 
number of ovules formed and the number of seeds developing 
per locule for a series of pods of Hibiscus Syriacus taken in the 
Missouri Botanical Garden in the autumn of 1907. Table I gives 

the data. 

TABLE I 








1 


2 


3 


4 


5 


6 


7 


8 


9 

5 
5 


Total. 


3 

4 
5 
6 
7 
8 
9 


1 

12 

63 

732 

208 

112 

1 


1 

12 

115 

1,148 

350 

208 

5 


1 

21 

127 

1,415 

450 

300 

5 


17 

146 

1,598 

525 

244 

5 


5 

153 

1,865 

635 

326 

2 


83 

1,829 

690 

368 

5 


1,212 

714 

395 

3 


433 

255 

9 


151 

5 


3 

67 

687 

9,799 

4,005 

2,359 

45 


Total 


1,129 


1,839 


2,319 


2,535 


2,986 


2,975 


2,324 


697 


156 


16,965 



Ovules per locule = x, seeds per locule = y 

The total number of seeds developing for each ovule-class is 
found with great ease and rapidity by multiplying the frequen- 
cies by the number of seeds per locule and summing at the same 
time on the machine. 9 This gives Table II. By multiplying the 

TABLE II 



Ovule Class. 



Totals 



Frequency. 


Total Seeds. 


3 


3 


67 


125 


687 


1,834 


9,799 


32,649 


4,005 


16,130 


2,359 


10,047 


45 


229 


16,965 


61,017 



totals of seeds by the number of ovules in the locules in which 
they were produced and summing at the same time we find 
S(x'y') =400,920. 

8 The method is applicable to graduated variates as well, since in the 
process of calculation the centers of the classes are taken as integral. 
° I use a Brunsviga. A comptmeter will serve well for this. 
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To complete the calculation of the correlation r XIJ , we require 
only A , A S) <r , <j s . Taking the origin at for both of these char- 
acters 10 and multiplying and summing at the same time, as 
we did before, we get : 

S(x') =109,818, S{x' 2 ) =721,904, 

£(i/) =61,017, S(y' 2 ) =284,287, 

S(x')/N= 109,818/16,965= 6.4732 = A„ 

S(y')/N= 61,017/16,965 = 3.5966 = ^, 



<r,= VS( x' 2 ) jN -A/= .80629, 



« u = Vs(j/ 2 )/N— ^/ = 1.95485, 
<r z <r s 

Now I think, notwithstanding these large numbers, 11 there has 
been a material gain in facility of calculation. The writing 
down of the indices showing the products of deviation has been 
entirely avoided. The direct calculation of the totals of arrays 
is easy, and it is only one short step further to obtain the means 
of arrays for testing linearity, either by the sensible agreement of 
empirical and theoretical means, as shown by a graph or by the 
application of Blakeman 's test to rf — r 2 . Note also that in the 
conventional method of obtaining S(x'y') there is no way of 
checking the work except to go over it independently. In the 
method here described the larger part of the work of obtaining 
S(x'y') is at once checked by the fact that the sum of the totals 
of the y arrays = S(y'). The final multiplication and summa- 
tion is very quickly verified. 

There are advantages in the method beyond these indicated by 
this illustration. 

Suppose that one wishes to correlate between a first character 
and a number of repeated characters and their sum. This some- 
times happens in work on fertility. Take as an illustration a 
table 12 showing the relationship between the length of the fruit 
and the number of ovules on the two placentae in the bloodroot, 

10 Where the range is great it may pay to use the conventional method 
in calculating the standard deviations. Of course it is quite immaterial for 
the method of calculating r suggested here, how the constants of the two 
variables are deduced. 

11 1 have purposely chosen an illustration giving large numbers. The 
series of observations with which we are dealing here is larger than is 
generally available in biological work. 

12 Table IV, 1907, Biometrilea, Vol. VII, p. 335, 1910. 
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Length. 


Number of 
Fruits. 


Total Ovules. 

19 

94 
636 
1,922 
4,442 
5,106 
4,590 
4,868 


Length. 


Number of 
Fruits. 


Total Ovule 


21-23 
24-26 
27-29 
30-32 
33-35 
36-38 
39-41 


2 
8 
56 
162 
336 
366 
318 
308 


45-47 
48-50 
51-53 
54-56 
57-59 
60-62 
63-65 


208 

142 

50 

20 

18 
4 
2 


3,615 
2,564 
965 
442 
458 
119 
64 


42-44 


Total 


1,000 


29,904 



Sanguinaria Canadensis. Table III gives the frequency and the 
total numbers of ovules produced for each length-class. 

S(a-'i/')=l,217,367. 13 

We now calculate the correlation for the length and ovules per 
placenta or length and total ovules at pleastire. In the first class 
we have 

___ (1,217,367/2000) —39.703 X 14.952 



6.30704X4.61361 
and in the second 

_ (1,217,367/1 000) — 39.703 X 29.904 
6.30704X9.05447 



.517 ±.016, 



= .527 ±.015. 



Again, a worker may have the means and variabilities for two 
characters p and q, and wish to obtain the correlation without 
trouble of preparing a formal table. He simply seriates his cards 
according to the p character, sums the values of the associated q 
characters and Avrites down grades of p and the associated totals 
in a table like II or III. The remainder of the work is as illus- 
trated above. 

Or, finally, in short series r may be very quickly calculated by 
summing the product of the values of the two characters of the 
individuals, dividing by the total number, subtracting the prod- 
uct of the two means, and dividing the result by the product of 
the two standard deviations. 

"This number looks rather formidable, but it is read directly from the 
Brunsviga and can be verified in two or three minutes. Had we centered 
the length at class 36-38 and the number of ovules at 20, and worked by the 
conventional method we should have had to calculate and write on the cor- 
relation surface nearly 200 of the products of the deviations of the x and 
y characters from their assumed means. Some of these are rather large and 
there is no way of checking the work except to go over it independently. 
All this is obviated by the method suggested here. 
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The best illustration I have is from Apstein 's 14 work on My sis. 

Table IV shows the body length in millimeters and the number 
of eggs and embryos for a series of 52 individuals taken Feb- 
ruary, 1904, and Table V the same characters for a series of 24 
taken at all other times in 1904 and 1905. 



TABLE IV 



TABLE V 



Length and 


Length and 


Length and 


Length and 


Length and 


Eggs. 


Eggs. 


Eggs. 


Eggs. 


Eggs. 


13-9 


16-17 


18-23 


20-66 


12-27 


14-9 


16-21 


18-24 


21-26 


14-20 


15-11 


16-23 


18-35 


21-34 


14-21 


15-13 


17-16 


18-36 


21-48 


14-28 


15-13 


17-16 


19-26 


21-51 


15-11 


15-16 


17-16 


19-40 


21-51 


15-17 


16-13 


17-18 


19-45 


21-58 


15-45 


16-14 


17-19 


20-30 


21-62 


16-11 


16-14 


17-23 


20-39 


22-24 


17-12 


16-16 


17-26 


20-48 


22-47 


17-24 


16-17 


17-28 


20-49 


22-68 


18-13 


16-17 


18-22 


20-51 


22-71 


18-17 


16-17 


18-22 


20-58 


23-67 


18-21 



Length and 
Eggs. 

19-25 
19-26 
19-35 
19-51 
19-77 
20-16 
20-27 
20-47 
21-43 
22-15 
23-36 



The range of both characters is large. If one were to prepare 
a regtilar correlation table (omitting all unnecessary columns 
and rows) he would have tables of 341 and 222 compartments 
for the 52 and 24 observations ! 

Multiplying out with the help of the first two pages of Bar- 
tow 's tables for the higher scpiares, we find for the first series : 
S(x')=9i6; S(x') =17,516; S(y') =1,623; S(y") =67,089; 
S(x'y') =31,432; V = 52 whence 

r=[S(.i:/y')N— 18.1923 X 31 2115]/ (2. 4261 X 17.7768) = .85. 

For the second series S(x') = 424 ; S(x'*) = 7,672 ; S(y') = 665 ; 
S(y' 2 ) =24,149; S(x'y') =12,018; iV = 24, whence 

r=[S(.x'y')IN— 17.6667 X 27. 7083]/ (2.7487 X 15.4420 = .26. 

The professional statistician will note that mathematically 
there is nothing novel in the methods suggested. But I have 

"Apstein, C, " Lebensgeschielite von Mysis mixta Lillij," Wissenschaft- 
Uclie Meeresuntersuelmngen, Abteilung Kiel, N. P., Bd. IX, s. 241-260, 1906. 

I give Apstein 's data merely as an illustration of a method of calcula- 
tion. Not only are the numbers too small for biological conclusions of much 
value, but the series are more or less heterogeneous, coming as they do from 
several stations. Apstein 's own conclusion from his data is "Dass die Zahl 
der Embryonen nicht so von Salzgehalt als von Grosse der Mutter abhangig 
ist. ' ' 
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found them a great service, and since I have never seen them in 
use 15 by other calculators it seems worth while to direct attention 
to them. 

J. Arthur Harris. 
Cold Spring Harbor, L. I., 
August 27. 

15 In some respects Hardy 's summation method described in detail by 
Elderton in "Frequency Curves and Correlation" is similar, but I believe 
that the one here proposed requires less labor. 



